This R Markdown document includes contributions from Professor Bret Larget

Setup details

Binomial Distribution

The BINS acronym for binomial assumptions

A binomial random variable satisfies the following properties:

  • B = binary outcomes (each trial may be categorized with two outcomes, conventionally labeled success and failure)
  • I = independence (results of trials do not affect probabilities of other trial outcomes)
  • N = fixed sample size (the sample size is pre-specified and not dependent on outcomes)
  • S = same probability (each trial has the same probability of success)

Binomial Parameters

Binomial Moments

B = 1000000
n = 100
p = 0.5

binomial_example = tibble(
  x = rbinom(B, n, p)
)

## use as.data.frame to print more digits
binomial_example %>% 
  summarize(mean = mean(x),
            variance = var(x),
            sd = sd(x)) %>% 
  as.data.frame()
##       mean variance       sd
## 1 49.99191 24.98402 4.998402
ggplot(binomial_example, aes(x=x)) +
  geom_histogram(aes(y = after_stat(density)),
                 center = 50, binwidth = 1,
                 color = "black", fill = "firebrick") +
  ylab("Probability") +
  ggtitle("Simulated Binomial Distribution",
          subtitle = "n = 100, p = 0.5") +
  theme_minimal()

binomial_example %>% 
  count(x) %>% 
  ungroup() %>% 
  mutate(p = n/sum(n)) %>% 
ggplot(aes(x=x, y=p, xend=x, yend=0)) +
  geom_segment(color = "blue") +
  geom_hline(yintercept = 0) +
  ylab("Probability") +
  ggtitle("Simulated Binomial Distribution",
          subtitle = "n = 100, p = 0.5") +
  theme_minimal()

Explore Binomial Graphs

n_options = 2^seq(0, 10, by = 1)
p_options = c(0.1, 0.5, 0.9)

binomial_plots = expand_grid(n_options, p_options) %>% 
  pmap(~{
    gbinom(.x, .y,
           color = if_else(.y == 0.1, "red", 
                           if_else(.y == 0.5, "purple", "blue")),
           scale = TRUE) +
      theme_minimal()
  })

ggarrange(plots = binomial_plots,
          ncol = length(p_options),
          nrow = length(n_options))

Binomial Probabilities

\[ P(X = k) = \binom{n}{k} p^k(1-p)^{n-k}, \qquad \text{for $k=0,1,\ldots,n$} \]

\[ \binom{n}{k} = \frac{n!}{k!(n-k)!} \]

Example

  • The R function choose() calculates \(\binom{n}{k}\)
  • The R function factorial() calculates \(n!\)
    • This calculation is not stable for moderately large values.
    • A more stable option is lgamma() where lgamma(x+1) is equal to \(\ln x!\).
  • But we will use the built-in function dbinom() to calculate binomial probabilities without worrying about the formula or numerical instabilities.
n = 5
p = 0.5
k = 2

choose(n, k)
## [1] 10
factorial(n)/(factorial(k)*factorial(n-k)) # n! /(k!(n-k)!)
## [1] 10
choose(n, k)*p^k*(1-p)^(n-k)
## [1] 0.3125
dbinom(k,n,p) # see below
## [1] 0.3125

R Binomial Functions

Demonstrations

rbinom()

# Simulate binomial random variables
n = 5
p = 0.5
x = rbinom(6, n, p)
x
## [1] 4 2 3 3 4 3
mean(x) # estimate based on sample
## [1] 3.166667
n*p # true mean
## [1] 2.5
var(x) # estimate based on sample
## [1] 0.5666667
n*p*(1-p) # true variance
## [1] 1.25

dbinom()

## Binomial "density" calculations
dbinom(0:n, n, p)
## [1] 0.03125 0.15625 0.31250 0.31250 0.15625 0.03125
dbinom(4, n, p)
## [1] 0.15625
choose(n, 4)*p^4*(1-p)^(n-4)
## [1] 0.15625

gbinom()

  • This is not a base R command, but is defined in the script ggprob.R.
## Binomial "density" plot
gbinom(n,p) 

pbinom()

## Binomial distribution calculations
## P(X <= x) = F(x), where F is the distribution function.

# P(X <= 3):
pbinom(3, n, p)
## [1] 0.8125
dbinom(0, n, p) + dbinom(1, n, p) + dbinom(2, n, p) + dbinom(3, n, p)
## [1] 0.8125
1 - dbinom(4, n, p) - dbinom(5, n, p)
## [1] 0.8125
# P(X > 3):
1 - pbinom(3, n, p) # 1 - P(X <= 3)
## [1] 0.1875
pbinom(3, n, p, lower.tail=FALSE) # P(X > 3)
## [1] 0.1875
dbinom(4, n, p) + dbinom(5, n, p)
## [1] 0.1875
# P(X < 3):
pbinom(3, n, p) - dbinom(3, n, p)
## [1] 0.5
pbinom(2, n, p) # P(X <= 2) = P(X < 3)
## [1] 0.5

qbinom()

## Binomial quantile calculations
## Docs:  The quantile is defined as the smallest value x such that P(X<=x) = F(x) ≥ p, where F is the distribution function.
qbinom(.2, n, p) # which x such that P(X <= x) = 0.2; there may not be an exact x
## [1] 2
dbinom(0, n, p) + dbinom(1, n, p) + dbinom(2, n, p)
## [1] 0.5
dbinom(0, n, p) + dbinom(1, n, p) 
## [1] 0.1875

Problems

Problem 1

What are the mean and standard deviation of the \(\text{Binomial}(90, 0.7)\) distribution?

Live Coding

Solution

The mean is \(\mu = (90)(0.7) = 63\).

The standard deviation is \(\sigma = \sqrt{90(0.7)(0.3)} = 4.347413\).

Problem 2

Plot the \(\text{Binomial}(90, 0.7)\) distribution. Add a red partly transparent solid vertical line at the mean, dashed vertical lines one standard deviation above and below the mean, and dotted lines two standard deviations above and below the mean.

Live Coding

Solution

n = 90
p = 0.7
mu = n*p
sigma = sqrt(n*p*(1-p))

gbinom(n, p, scale = TRUE) +
  geom_vline(xintercept = mu, color = "red", alpha = 0.5) +
  geom_vline(xintercept = mu + c(-1,1)*sigma,
             color = "red", linetype = "dashed") +
  geom_vline(xintercept = mu + c(-2,2)*sigma,
             color = "red", linetype = "dotted") +
  theme_minimal()

Problem 3

If \(X \sim \text{Binomial}(90, 0.7)\), what is \(P(X = \mu)\)?

Live Coding

Solution

dbinom(mu, n, p)
## [1] 0.09144637

Problem 4

If \(X \sim \text{Binomial}(90, 0.7)\), what is \(P(\mu - \sigma \leq X \leq \mu + \sigma)\)?

Live Coding

Solution

## endpoints
mu + c(-1,1)*sigma
## [1] 58.65259 67.34741
## Within one standard deviation
## P(mu - sigma <= X <= mu + sigma)
## P(59 <= X <= 67) = P(X <= 67) - P(X <= 58)
pbinom(67, n, p) - pbinom(58, n, p)
## [1] 0.6995895
## Sum of probabilities
sum( dbinom(59:67, n, p) )
## [1] 0.6995895

Problem 5

Find the 0.05 and 0.95 quantiles of the \(\text{Binomial}(90, 0.7)\) distribution.

Live Coding

Solution

qbinom(c(0.05, 0.95), n, p)
## [1] 56 70
## Compare to the tail probabilities
pbinom(56, n, p) # P(X <= 56) >= 0.05
## [1] 0.06954663
pbinom(55, n, p) # P(X <= 55) < 0.05
## [1] 0.04457106
## Compare to the tail probabilities
pbinom(70, n, p) # P(X <= 70) >= 0.95
## [1] 0.9609796
pbinom(69, n, p) # P(X <= 69) < 0.95
## [1] 0.9354706

Problem 6

Explain why each of the following random variables does not have a binomial distribution.

A

\(X_1\) is the number of times a fair coin is tossed until we have observed at least one head and one tail.

Solution

There is not a fixed number of trials.

B

\(X_2\) is the number of red balls drawn when one is picked from a bucket with 10% red balls, a second is picked from a bucket with 20% red balls, and a third is drawn from a bucket with 30% red balls.

Solution

The value of \(p\) is not the same for all trials.

C

\(X_3\) is the number of red balls selected in a sample of 5 chosen without replacement from a bucket with ten red balls and ten white balls.

Solution

The trials are not independent. The color of the first ball selected affects the probability of the color of the second ball, for example.

Problem 7

7A

Make a line plot of the 0.75 quantile of binomial distributions with \(p=0.7\) and \(n\) varying from 1 to 500 with \(n\) on the x axis and the quantile on the y axis. Describe the shape of the relationship between these variables.

Solution

prob7 = tibble(
  n = 1:500,
  q75 = qbinom(0.75, n, 0.7))

ggplot(prob7, aes(x=n, y=q75)) +
  geom_line()

  • The relationship is approximately linear.

7B

If \(q_{0.75}(n)\) is the 0.75 quantile of the binomial distribution with \(p = 0.7\), define \(z = (q_{0.75}(n) - np)/\sqrt{np(1-p)}\). (Subtract the binomial mean and divide by the binomial standard deviation.) Make a line plot of \(z\) on the y axis and \(n\) varying from 1 to 500 on the x axis. Add a straight regression line to the plot. Describe the plot.

Solution

prob7 = prob7 %>% 
  mutate(mu = n*0.7,
         sigma = sqrt(n*0.7*0.3),
         z = (q75 - mu)/sigma)

ggplot(prob7, aes(x=n, y = z)) +
  geom_line() +
  geom_smooth(method = "lm")

  • The pattern oscillates around a nearly horizontal line with the size of the oscillations decreasing as \(n\) increases.
prob7 %>% 
  summarize(z = mean(z))
## # A tibble: 1 × 1
##       z
##   <dbl>
## 1 0.683
  • The mean value of \(z\) is approximately 0.68.

Normal Distributions

Overview

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

mu = 3
sigma = 0.5
mu2 = 7
sigma2 = 2

gnorm(mu=mu, sigma=sigma, a=-0, b=12, color="blue",
      fill=NULL, title=FALSE) +
  geom_norm_density(mu = mu2, sigma = sigma2, color = "magenta") +
  theme_minimal()

Normal Probabilities

\[P(a \leq X \leq b)\]

Examples:

Area to the left

  • Find \(\mathsf{P}(X < 80)\)
## naming arguments; I usually do not
pnorm(q=80, mean=100, sd=25)
## [1] 0.2118554
## without the names
pnorm(80, 100, 25)
## [1] 0.2118554
## gnorm is defined in ggprob.R
## you need to call source("../../scripts/ggprob.R") in set-up for this to work
## a is the left endpoint of the interval to fill
## b is the right endpoint of the interval to fill
## use NULL for the default value (all the way to the left or right)
gnorm(mu = 100, sigma = 25) +
  geom_norm_fill(mu = 100, sigma=25, a = NULL, b = 80) +
  theme_minimal()

Area to the right

  • Find \(\mathsf{P}(X > 90)\)
## using lower.tail
pnorm(90, 100, 25, lower.tail = FALSE)
## [1] 0.6554217
## using subtraction
1 - pnorm(90, 100, 25)
## [1] 0.6554217
## first argumment is mu, second is sigma if you do not want to specify names
## Need to repeat mu and sigma in each level: not set in aes()
gnorm(100, 25) +
  geom_norm_fill(100, 25, a = 90, b = NULL) +
  theme_minimal()

Area between two values

  • Find \(\mathsf{P}(50 < X < 120)\)
    • Note: \(\mathsf{P}(50 < X < 120) = \mathsf{P}(X < 120) - \mathsf{P}(X < 50)\)
## use subtraction
pnorm(120, 100, 25) - pnorm(50, 100, 25)
## [1] 0.7653945
gnorm(100, 25) +
  geom_norm_fill(100, 25, a = 50, b = 120) +
  theme_minimal()

Two tail areas

  • Find \(\mathsf{P}(|X-100| > 30)\)
    • This is \(\mathsf{P}(X < 70) + \mathsf{P}(X > 130)\)
## using lower.tail
pnorm(70, 100, 25) + pnorm(130, 100, 25, lower.tail = FALSE)
## [1] 0.2301393
## using subtraction
pnorm(70, 100, 25) + (1 - pnorm(130, 100, 25))
## [1] 0.2301393
gnorm(100, 25) +
  geom_norm_fill(100, 25, a = NULL, b = 70) +
  geom_norm_fill(100, 25, a = 130, b = NULL) +
  theme_minimal()

Standard Normal Distribution and Standardization

gnorm() +
  theme_minimal()

Example

\(X \sim \text{N}(100, 25)\)

  • Find \(\mathsf{P}(X < 80)\)
mu = 100
sigma = 25
x = 80
z = (x-mu)/25

## Using X directly
pnorm(x, mu, sigma)
## [1] 0.2118554
## Standardizing first
## default values for the mean and sd are 0 and 1
pnorm(z)
## [1] 0.2118554
gnorm(mu, sigma) +
  geom_norm_fill(mu, sigma, b=x) +
  theme_minimal()

gnorm() +
  geom_norm_fill(b=z) +
  theme_minimal()

  • Notice the graphs are identical except for axis labels
  • All normal curves have exactly the same shape, but they may differ in center and scale

Normal Quantiles

Examples

\(X \sim \text{N}(100, 25)\)

Find the 0.1, 0.25, 0.9, 0.95, 0.975, and 0.99 quantiles of the distribution.

p = c(0.1, 0.25, 0.9, 0.95, 0.975, 0.99)
mu = 100
sigma = 25

qnorm(p, mu, sigma)
## [1]  67.96121  83.13776 132.03879 141.12134 148.99910 158.15870

Graphs

## use ggarrange() to display multiple plots and map() to create a list of plots

normal_plots = p %>% 
  map(~{
    gnorm(mu, sigma) +
    geom_norm_fill(mu, sigma, a = NULL, b = qnorm(.x, mu, sigma)) +
    theme_minimal()
  })

ggarrange(plots = normal_plots,
          nrow = length(p),
          ncol = 1)

Simple Normal Calculation example

The weights of packets of cookies produced by a certain manufacturer have a normal distribution with a mean of 202 g and a standard deviation of 3 g. What is the weight that should be labeled on the packet so that only 1% of the packets are underweight?

  • \(X \sim \text{N}(202, 3)\), want \(P(X < a) = 0.01\)
## This gives the 1st percentile of a N(202, 3) distribution
qnorm(0.01, mean=202, sd=3, lower.tail=TRUE)
## [1] 195.021
## Check
pnorm(195.021, mean=202, sd=3) # P(X <= 195.021) = .01
## [1] 0.01000039

Normal approximation to the binomial distribution

## Assumptions satisfied
n = 100
p = 0.5
mu = n*p
sigma = sqrt(n*p*(1-p))

gbinom(n, p, scale=TRUE) +
  geom_norm_density(mu, sigma, color = "red") +
  theme_minimal()

Example where the approximation is poor

  • Here is an example where the assumption is not satisfied.
    • \(n = 100\)
    • \(p = 0.01\)
    • \(np(1-p) = 0.99\)
## Criteria not satisfied
n = 100
p = 0.01
mu = n*p
sigma = sqrt(n*p*(1-p))

gbinom(n, p, scale=TRUE) +
  geom_norm_density(mu, sigma, color = "red") +
  theme_minimal()

Correction for Continuity

  • If we want to calculate a binomial probability, we should just do so directly

  • However, if we do want to use a normal approximation, your calculation will be more accurate if add or subtract 0.5 as appropriate.

  • The exact binomial probability that \(X=x\) is approximated by the area under a normal density curve between \(x-0.5\) and \(x+0.5\).

  • So, for example, when \(n = 20\) and \(p = 0.4\), the exact binomial probability \(\mathsf{P}(X \le 6)\) and the normal approximation.

## Exact calculation
pbinom(6, 20, 0.4)
## [1] 0.2500107
  • For the normal approximation, note that \(\mathsf{P}(X \le 6) = \mathsf{P}(X < 7)\).
    • The area under a normal curve to the left of 6 misses part of the binomial probability exactly at 6 and is too small.
    • The area under a normal curve to the left of 7 includes part of the binomial probability exactly at 7 and is too big.
## Normal approximations without correction

## moments
mu = 20*0.4
sigma = sqrt(20*0.4*0.6)

## approximations
pnorm(6, mu, sigma)
## [1] 0.1806552
pnorm(7, mu, sigma)
## [1] 0.3240384
  • A better approximation uses 6.5 as the endpoint for the normal approximation.
pnorm(6.5, mu, sigma)
## [1] 0.2467814
  • But the exact calculation is the most accurate, of course.
cc_example = tibble(
  x = 0:20,
  p = dbinom(x, 20, 0.4)
)
  • Approximation too small
ggplot(cc_example, aes(x = x, y = p)) +
  geom_col(fill = "gray") +
  geom_norm_fill(mu = mu, sigma = sigma, b = 6, alpha = 0.5) +
  geom_norm_density(mu, sigma, color = "blue") +
  geom_hline(yintercept = 0) +
  theme_minimal()

  • Approximation too large
ggplot(cc_example, aes(x = x, y = p)) +
  geom_col(fill = "gray") +
  geom_norm_fill(mu = mu, sigma = sigma, b = 7, alpha = 0.5) +
  geom_norm_density(mu, sigma, color = "blue") +
  geom_hline(yintercept = 0) +
  theme_minimal()

  • Approximation better
ggplot(cc_example, aes(x = x, y = p)) +
  geom_col(fill = "gray") +
  geom_norm_fill(mu = mu, sigma = sigma, b = 6.5, alpha = 0.5) +
  geom_norm_density(mu, sigma, color = "blue") +
  geom_hline(yintercept = 0) +
  theme_minimal()

Setup details

  • You will need the package tidyverse for this file

  • Also, install the broman package for the myround() function which actually rounds output to the requested number of digits (as a string).

  • This lecture use the following scripts, assumed to be in your course scripts directory.

    • COURSE/scripts/viridis.R
    • COURSE/scripts/ggprob.R
  • The file ggprob.R contains a number of functions for graphing probability distributions in a tidyverse-friendly manner.

Simulation and Probability

Example 1

  • We begin with an example where we have learned a method.
  • Suppose that a random variable \(X_1 \sim \text{Normal}(250, 30)\) meaning \(\mathsf{E}(X) = \mu = 250\) is the mean of the distribution and that the standard deviation is \(\sigma = 30\).
  • Suppose also we were interested in calculating \(\mathsf{P}(X > 300)\).
  • From a previous lecture, we know we just need to find the area to the right of 300 under the normal density for this distribution.
  • The base R function pnorm() can give us a numerical answer.
1 - pnorm(300, 250, 30)
## [1] 0.04779035
  • A graph helps to visualize the answer.
gnorm(250, 30) +
  geom_norm_fill(250, 30, a = 300)

Simulation Solution

  • Now, let’s say we forgot about the existence of pnorm(), but we did know how to use rnorm() to sample random variables.

  • The idea would be to sample many instances random variables with this distribution and merely count the proportion that are larger than 300.

  • First, we do a tidyverse solution where we create a data frame with the random sample in a column and then use dplyr commands to calculate the desired summary.

    • We will use a simulation size of 1,000,000 and repeat a few times to get a sense of the error due to simulation.
## setting the seed here so that we get the same values each time we knit
## We will only set the seed once in this file
set.seed(2022)

B = 1000000
prob1 = tibble(
  x1 = rnorm(B, 250, 30),
  x2 = rnorm(B, 250, 30),
  x3 = rnorm(B, 250, 30),
  x4 = rnorm(B, 250, 30))

prob1_prob = prob1 %>% 
  summarize(p1 = mean(x1 > 300),
            p2 = mean(x2 > 300),
            p3 = mean(x3 > 300),
            p4 = mean(x4 > 300))
prob1_prob
## # A tibble: 1 × 4
##       p1     p2     p3     p4
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1 0.0477 0.0478 0.0480 0.0479
  • It appears that if we round off our answers to four decimal places, different simulations give slightly different answers.
  • Rounding to three decimal places, we are reasonably confident that the probability is 0.048.
  • Compare again to the exact answer rounded to four decimal places.
round(1 - pnorm(300, 250, 30), 4)
## [1] 0.0478

Base R Solution

  • We could have repeated the same calculation using base R.
  • In this course, we tend to show tidyverse solutions most of the time, but knowing how to do similar calculations in base R is helpful.
## A base R solution
prob1_base = mean( rnorm(B, 250, 30) > 300 )
round(prob1_base, 4)
## [1] 0.0478

Example 2

  • Next, let’s try another example where we have theory to tell us the exact number.

  • Let \(X_2 \sim \text{Binomial}(500, 0.2)\) so the mean is \(\mathsf{E}(X_2) = 500 \times 0.2 = 100\) and the standard deviation is \(\sigma = \sqrt{500(0.2)(0.8)} \doteq 8.94\).

  • Find \(\mathsf{P}(X \ge 120)\).

  • As a review, let’s do the calculation exactly first with pbinom().

## Notice P(X >= 120) = 1 - P(X <= 119)
prob2 = 1 - pbinom(119, 500, 0.2)
prob2
## [1] 0.0160918
  • Here is a simulation based solution
B = 1000000
prob2_df = tibble(
  x = rbinom(B, 500, 0.2)
)

prob2_sim = prob2_df %>% 
  summarize(prob2 = mean(x >= 120)) %>% 
  pull(prob2)

prob2_sim
## [1] 0.016149
  • In this example, we get the same value when rounded off to four digits, but that may have been a bit lucky as we might have easily been off in the fourth digit.
    • But the simulation was large enough, it appears, to be highly probable in repeated simulations to have been accurate to three decimal places.

Example 3

  • Here is an example that delves into random sampling.

  • Suppose that \(X_1, X_2, \ldots, X_{10}\) each have a \(\text{Normal}(100, 20)\) distribution and that they are mutually independent.

    • Informally, mutually independent means that the realized values of any of the random variables have no effect on the probabilities of the values of the others.
  • Let \(\bar{X}\) be the sample mean, \(\bar{X} = (X_1 + \cdots + X_{10})/10\).

  • Find \(\mathsf{P}(95 < \bar{X} < 105)\).

  • We may write this set-up as \(X_i \overset{\text{iid}}{\sim} \text{Normal}(100, 20), \text{ for } i = 1,\ldots,10\).

    • The text “iid” is short for independent and identically distributed, a phrase used often when studying probability and mathematical statistics more formally.

Solution

  • If we had some theory that told us that \(\bar{X}\) had a normal distribution with a mean and standard deviation we could calculate, it would be possible to answer this question with a direct calculation.
  • In the absence of this theory, let’s do a simulation.
  • In this solution, we will use the function map_dbl() which is from the tidyverse package purrr which is used for iteration.
    • The subscript _dbl specifies that the return value is a vector of numerical values (double precision floating point values).
  • The first argument will be the items we want to iterate over, in this case a vector of length \(B = 1,000,000\).
  • The second argument is a function to call.
    • Here, we just need to repeated taking the mean of a random sample of 10 normal random variables which we can do with mean(rnorm(10, 100, 20)).
  • The code chunk below simulates each sample and calculates and saves the mean, without retaining the individually sampled variables.
    • We save the means of the simulated samples in a single column of a tibble.
    • It takes a few seconds to run.
B = 1000000
prob3_df = tibble(
  xbar = map_dbl(1:B, ~mean(rnorm(10, 100, 20))))
  • Then compute the observed proportion of sample means between 95 and 105.
prob3 = prob3_df %>% 
  summarize(p = mean( between(xbar, 95, 105))) %>% 
  pull(p)

prob3
## [1] 0.571454
  • We see that the answer is close to 0.571.

Visualize

  • We will visualize the answer by graphing an estimated density of the sample in blue, and then overlaying a normal density with the mean and standard deviation which match these calculated summary statistics, and shading the area between 95 and 105.
prob3_sum = prob3_df %>% 
  summarize(mu = mean(xbar),
            sigma = sd(xbar))

mu = prob3_sum %>% 
  pull(mu)

sigma = prob3_sum %>% 
  pull(sigma)

mu
## [1] 100.0044
sigma
## [1] 6.322996
  • Not surprisingly, the sample mean of the 1,000,000 simulated sample means is close to 100, the mean of each \(X_i\).
  • The standard deviation is less than \(\sigma = 20\) but does not look close to any simple number
    • (More on this soon!)
  • Here is the plot.
ggplot(prob3_df, aes(x = xbar)) +
  geom_density(color = "blue") +
  geom_norm_density(mu, sigma, color = "red") +
  geom_norm_fill(mu, sigma, a = 95, b = 105, fill = "red") +
  geom_hline(yintercept = 0) +
  xlab("Sample Mean") +
  ylab("Density") + 
  ggtitle("Example 3: normal sample mean")

Observations

  • Notice that the estimated density curve is nearly a perfect match for the theoretical normal density with these moments.

  • This is not a coincidence:

    • When a random sample of size \(n\) is sampled from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), the sample mean \(\bar{X}\) has a normal distribution with mean \(\mu\) and standard deviation \(\sigma / \sqrt{n}\).
    • We write this succinctly as \(\bar{X} \sim \text{Normal}(\mu, \sigma / \sqrt{n})\)
  • For our numerical example, \(\text{SD}(\bar{X}) = 20 / \sqrt{10} \doteq 6.325\) which is very close to our observed sample standard deviation of 6.323.

  • If we had used theory and not simulation, we could have calculated the probability like this.

pnorm(105, 100, 20 / sqrt(10)) - pnorm(95, 100, 20 / sqrt(10))
## [1] 0.5708047

Example 4

  • Suppose that the random variables \(U_1, \ldots, U_{20}\) are each uniformly distributed between 0 and 10: \(U_i \overset{\text{iid}}{\sim} \text{Uniform}(0,10) \text{ for } i = 1, \ldots, 20\).
  • What is the probability that the sum \(S_{20} = U_1 + \cdots + U_{20}\) is larger than 115?

Solution

  • The \(\text{Uniform}(0,10)\) distribution has a density equal to 0.1 between 0 and 10 and is 0 elsewhere.
unif_0_10 = tibble(x = c(-1, 0, 10, 11),
                   y = c(0, 0.1, 0.1, 0))

ggplot(unif_0_10, aes(x=x, y=y)) +
  geom_step(color = "blue") +
  geom_hline(yintercept = 0) +
  xlab("U") +
  ylab("Density") +
  ggtitle("Uniform(0,10) Distribution")

  • Let’s solve using simulation
    • Generate a large number \(B\) of samples of size 20 from this uniform distribution
    • Find the sum of the 20 values in each sample
    • Save these \(B\) sample sums
    • Calculate the proportion that are larger than 115.
B = 1000000

prob4_df = tibble(
  s20 = map_dbl(1:B, ~sum(runif(20, 0, 10))))
  • Estimate the probability
p4 = prob4_df %>% 
  summarize(p = mean(s20 > 115)) %>% 
  pull(p)

p4
## [1] 0.123557
  • The probability estimated by simulation is about 0.124.

Theory

  • An exact theoretical calculation is quite difficult.
  • But, let’s take a look at the estimated density of the sampled sums and compare to a normal distribution where the mean and standard deviation match the observed values.
prob4_sum = prob4_df %>% 
  summarize(mu = mean(s20),
            sigma = sd(s20),
            p = mean(s20 > 115))
prob4_sum
## # A tibble: 1 × 3
##      mu sigma     p
##   <dbl> <dbl> <dbl>
## 1  100.  12.9 0.124
mu = prob4_sum$mu
sigma = prob4_sum$sigma

mu
## [1] 99.99506
sigma
## [1] 12.91052

Plot

ggplot(prob4_df, aes(x = s20)) +
  geom_density(color = "blue") +
  geom_norm_density(mu, sigma, color = "red") +
  geom_norm_fill(mu, sigma, a = 115, fill = "red") +
  geom_hline(yintercept = 0) +
  xlab("Sum") +
  ylab("Density") +
  ggtitle("Sum of Uniforms and Normal Approximation")

  • Again, the exact probability is well-approximated by an area under a normal density curve because the exact distribution of the sum of 20 uniform random variables is nearly normal.
    • This is not a coincidence!

Theory

  • The exact distribution of the sum of uniform random variables is not normal in general.
  • But:
    • \(\mathsf{E}(S_{20})\) is exactly equal to the sum of the means \(\mathsf{E}(U_i)\), or \(20 \times 5 = 100\); and
    • \(\mathsf{Var}(S_{20})\) is exactly equal to the sum of the variances of the \(\{U_i\}\), \(\mathsf{Var}(S_{20}) = 20 \times 100/12 \doteq 166.67\), meaning that the exact standard deviation is \(\mathsf{SD}(S_{20} = \sqrt{2000/12} \doteq 12.9)\).
  • Using the normal approximation,
1 - pnorm(115, 100, sqrt(2000/12))
## [1] 0.1226391
  • So the simulated probability differed from the theoretical value by about one in the third significant digit, a small relative error.

Central Limit Theorem

  • The past two problems highlight a fundamental result from probability and mathematical statistics called the central limit theorem.
  • A rigorous statement of the theorem states how the exact probability distribution of a sample mean converges to a normal distribution as the sample size increases to infinity.
  • A practical statement is as follows:

The probability distribution of a sample mean of size \(n\) from an iid random sample from a distribution with mean \(\mu\) and standard deviation \(\sigma\) has mean equal to \(\mu\), standard deviation equal to \(\sigma/\sqrt{n}\). Furthermore, if \(n\) is sufficiently large, this distribution of the sample mean is approximately normal.

  • In practice, \(n\) need not be all that large.
  • If the base distribution is fairly symmetric, then the normal approximation can be very good for small to moderate \(n\) (say, a few dozen or less).
  • However, if the base distribution is very skewed or has a very high probability on a single value, then \(n\) may need to be substantially larger before the normal approximation is accurate.

Example 5

  • Here is an example where the normal approximation fails, but simulation can still obtain an accurate value.

  • The gamma distribution is the distribution of a positive continuous random variable with a density function equal to \(f(x) = C x^{\alpha-1} \mathrm{e}^{-\lambda x}, \ x > 0\) where \(\alpha>0\) and \(\lambda>0\) are positive parameters which determine the shape and scale of the distribution, respectively, and \(C\) is a constant (depending on the values of \(\alpha\) and \(\lambda\), but not \(x\)) so that the total area under the density curve equals one.

  • When \(\alpha\) is small, the gamma density is strongly skewed to the right.

    • For example, when \(\alpha = 0\), the density is the tail of an exponential function.
  • The mean of the distribution is \(\mu = \alpha / \lambda\) and the variance is \(\sigma^2 = \alpha / \lambda^2\) so that the standard deviation is \(\sigma = \sqrt{\alpha}/\lambda\). Here is a graph of the density when \(\alpha = 0.9\) and \(\lambda = 0.09\). The mean is \(\mu = 0.9 / 0.09 = 10\) and the standard deviation is \(\sigma = \sqrt{0.9}/0.09 \doteq 10.54\).

Problem

  • Suppose that \(X_1, \ldots, X_4 \overset{\text{iid}}{\sim} \text{Gamma}(0.9, 0.09)\)
  • Estimate the probability that the sample mean is larger than 20.

Simulation

  • Let’s first solve by simulation.
  • We can use the rgamma() function to generate random gamma variables and use map_dbl() as earlier to take sample means.
B = 1000000
prob5_df = tibble(
  xbar = map_dbl(1:B, ~mean(rgamma(4, 0.9, 0.09))))
  • Plot the density
  • Add a normal curve
  • Theory tells us that \(\mathsf{E}(\bar{X}) = 10\) and \(\mathsf{SD}(\bar{X}) = \frac{\sqrt{0.9}/0.09}{\sqrt{4}} \doteq 5.27\)
alpha = 0.9
lambda = 0.09
mu = alpha/lambda
sigma = sqrt(alpha) / lambda

ggplot(prob5_df, aes(x=xbar)) +
  geom_density(color = "blue") +
  geom_norm_density(mu, sigma/2, color = "red") +
  geom_norm_fill(mu, sigma/2, a = 20, fill = "red") +
  geom_hline(yintercept = 0) +
  xlab("Sample Mean") +
  ylab("Density") +
  ggtitle("Skewed Gamma Distribution Sample Mean, n=4")

  • We can see that the simulated density (blue) is not well approximated by the normal curve with the theoretical mean and standard deviation.
  • The true probability will be much larger than that we would calculate using the central limit theorem.
Simulation-based probability calculation
prob5 = prob5_df %>% 
  summarize(p = mean(xbar > 20))

prob5
## # A tibble: 1 × 1
##        p
##    <dbl>
## 1 0.0491
Central Limit Theorem
mu
## [1] 10
sigma
## [1] 10.54093
1 - pnorm(20, mu, sigma/sqrt(4))
## [1] 0.02888979
  • The CLT value is less than half the simulation-based calcultion.
True Value
  • Using theory beyond the scope of this course, we can calculate the exact probability.
prob5_exact = 1 - pgamma(20, 4*alpha, 4*lambda)
  • The simulated value 0.0491 is very close to the true theoretical value 0.0493.

Simulation can be very accurate without any knowledge of the theory to make exact calculations and without knowing the true mean and standard deviation.

  • But, the CLT is handy

When the sample size \(n\) is large enough for a given problem, then a simple area under a normal curve can be very accurate.

Example 6

  • Here is another example with \(n=4\) where the central limit theorem is accurate.
  • We use the gamma distribution again, but set \(\alpha = 9\) and \(\lambda = 0.9\) so that the mean \(\mu = 10\) again and \(\sigma = \sqrt{9}/0.9 \doteq 3.33\).
  • Find \(\mathsf{P}(\bar{X} > 12)\).

Simulation

B = 1000000
prob6_df = tibble(
  xbar = map_dbl(1:B, ~mean(rgamma(4, alpha, lambda))))
  • Plot the density
  • Add a normal curve with the same mean and sd
alpha = 9
lambda = 0.9
mu = alpha/lambda
sigma = sqrt(alpha) / lambda

ggplot(prob6_df, aes(x=xbar)) +
  geom_norm_density(mu, sigma/2, color = "red") +
  geom_norm_fill(mu, sigma/2, a = 12, fill = "red") +
  geom_density(color = "blue") +
  geom_hline(yintercept = 0) +
  xlab("Sample Mean") +
  ylab("Density") +
  ggtitle("More Symmetric Gamma Distribution, Sample Mean, n=4")

  • We can see that the simulated density (blue) is fairly well approximated by the normal curve with the theoretical mean and standard deviation, even if not perfect, and \(n=4\) only!
  • The graph shows which intervals the areas are nearly the same and where they may differ by a bit.
  • For a moderately large \(n\), such as \(n=40\), the normal approximation would have been much more accurate for nearly any interval.
Comparison
prob6 = prob6_df %>% 
  summarize(simulation = mean(xbar > 12),
            clt = 1 - pnorm(12, mu, sigma/sqrt(4)),
            exact = 1 - pgamma(12, 4*alpha, 4*lambda))

prob6
## # A tibble: 1 × 3
##   simulation   clt exact
##        <dbl> <dbl> <dbl>
## 1      0.119 0.115 0.118